Scope

This document aims to perform clustering on scRNAseq data already passed QC. This analysis is first performed without samples integration. Obtained results will be used to assess if integration is required or not**.

SPOILER: No samples integration is required since all of them perfectly overlap.

It is out of the scope to annotate cell types in this dataset, however cluster markers are obtained for further consideration.

Primary data

A total of six samples were library-prepared and sequenced by NovoGene Co, Ltd. Libraries were prepared using 10x Single Cell 3’v3 kit. Corresponding biological samples were delivered to IJC at the end of October’23 and sequencing data was received at the end of February’24. Samples are equally divided in two conditions, considering cells from mouse embryoid bodies (mEBs 144h) :

  • in wild-type (WT) condition (3x) and,
  • after activation of 7* specific genes (7g) (3x) identified by CRISPRa in an earlier project phase.

*Genes related to HSC development fate.

Methods

Following criteria is applied:

  • SCTransform-based normalization (v2) regressing out mitochondrial content. Rest of parameters by default.
  • Dimensionality reduction considering 35 PCs which corresponds to a total cumulative variance explained of 85%.
  • Clustering (Louvain algorithm) with resolutions explored: from 0.2 to 0.4 (less to more specific)
  • FindConservedMarkers was used to identify differentially expressed genes among clusters. By default, the non-parametric Wilcoxon rank sum test is applied. Other default parameters were considered except for logFC threshold increased to 0.5 (default is 0.1) and minimum pct to 0.25 (default is 0.01) which refers to only test genes that are detected in this minimum fraction of cells (in either of the two populations under test). Only positive genes are considered.

All previous steps are performed by means of Seurat R package (Hao et al. 2023) (v5.0.0) together with glmGamPoi R package (Ahlmann-Eltze and Huber 2021) (v.1.12.2).

Results

As a reminder, the dataset to be analyzed contains 21894 features (genes) over 34950 cells distributed in 6 samples from two conditions. In this case, ribosomal genes were excluded from the dataset. Low quality cells were also discarded as a result of a previous QC.

Normalization and dimensionality reduction (PCA, UMAP)

Normalization aims to remove technical factors such as the library size that may confound the real biological heterogeneity. In this case, during normalization, mitochondrial content is also removed as a possible source of variation in our samples.

Once data is normalized, next step is to proceed with dimensionality reduction, this is required for visualization purposes and ease clustering. Dimensionality reduction is completed using two methods: first PCA and secondly, UMAP, considering a specific number of PCs from PCA.

In order to choose a proper number of PCs to be consider for downstream analysis, those genes defining PCs can be checked as following indicated for the first 10 PCs (limited to 10 genes):

## PC_ 1 
## Positive:  Myl7, Acta2, Tnnt2, Ttn, Tpm1, Cdkn1c, Myl4, Tmsb4x, Myl3, Vim 
## Negative:  Dppa5a, Hbb-bh1, Mt1, Hba-x, Hba-a1, Mkrn1, Chchd10, L1td1, Mt2, Trh 
## PC_ 2 
## Positive:  Hbb-bh1, Hba-x, Hba-a1, Hba-a2, Hbb-bs, Hbb-y, Slc25a21, Reln, Fth1, Gypa 
## Negative:  Dppa5a, Mt1, Mkrn1, Trh, L1td1, Pou5f1, Chchd10, Tdh, Mt2, Zfp42 
## PC_ 3 
## Positive:  Myl7, Tnnt2, Ttn, Myl4, Myl3, Slc8a1, Acta2, Actc1, Dppa5a, Cdkn1c 
## Negative:  Ctla2a, Vim, Krt8, Krt18, Peg3, Tmsb4x, Ramp2, S100a10, Spink1, Fn1 
## PC_ 4 
## Positive:  Spink1, Krt18, Krt8, Emb, Tagln, Slc2a3, Car4, Tpm1, Hba-x, Ttr 
## Negative:  Ctla2a, Ramp2, Hapln1, Vim, Fli1, Cdh5, Ctla2b, Ptprm, Rhoj, Egfl7 
## PC_ 5 
## Positive:  Phlda2, Hand1, Pmp22, Unc5c, Prdm6, Pcsk5, Mesp1, Cdh11, Gpc6, Gpc3 
## Negative:  Ctla2a, Spink1, Tmsb4x, Emb, Ramp2, Krt18, Flt1, Slc2a3, Apoe, Malat1 
## PC_ 6 
## Positive:  Acta2, Tagln, Tmsb4x, Tpm1, Krt8, Ctla2a, Krt18, Vim, Hand1, Ahnak 
## Negative:  Malat1, Prkg1, Lsamp, Gm42418, Gpc6, Auts2, Gpc3, Meis2, Maml3, Lhx1 
## PC_ 7 
## Positive:  Tmsb4x, Acta2, Tagln, Dppa5a, Igfbp5, Igf2, Airn, Csrp2, Ahnak, Tpm1 
## Negative:  Spink1, Myl7, Emb, Fgf8, Tnnt2, Ttr, S100a10, Car2, Myl4, Mixl1 
## PC_ 8 
## Positive:  Tagln, Fgf8, T, Tmsb4x, Fst, S100a6, Sprr2a3, Lefty1, Psmb8, Sfn 
## Negative:  Spink1, Ttr, Unc5c, Dppa5a, Emb, Car4, Fth1, Hand1, Airn, Rbp4 
## PC_ 9 
## Positive:  Dppa5a, Hba-x, Hba-a1, S100a10, Gpc3, Ifitm1, Pcdh7, Phlda2, Pcsk5, Peg3 
## Negative:  Igfbp5, Reln, Ccnd2, Prkar2b, Fgf3, Pax9, Vgll2, Fth1, Lmo2, Prkca 
## PC_ 10 
## Positive:  Dppa5a, Igfbp5, Tmsb4x, Lsamp, Lhx1, Emb, Tmsb10, Ctla2a, Gpc3, Meis2 
## Negative:  Malat1, Unc5c, 2410006H16Rik, Gm42418, Ftl1-ps1, Magi2, Exoc4, Anks1b, Ddit3, Ankrd12

Another useful plot is the Elbow plot which ranks PCs based on their percentage of variance explained by each one.

An elbow is observed (approx) between 25-35 PCs, suggesting that the majority of true signal is captured in those first PCs. On the other hand, to achieve a cumulative percentage of captured variation greater than 85%, 38 PCs are required. Based on this, a total of 38 are considered for running UMAP.

The resulting UMAP layout distinguishing by samples condition is:

This representation shows a great overlapping between conditions, this can be interpreted as a good alignment among samples. Thus, it is concluded that there is no need for samples integration. If this plot is generated per sample, the same conclusion is kept:

It is recommended to visualize UMAP against other possible confounding variables i.e. mitochondrial content or number of genes detected per cell. This is shown in next figures:

Finally, cell phase can be inferred from a reference set of genes defined by (Tirosh et al. 2016) and phase labels represented in UMAP, as shown in following figure:

Since this dataset includes developing cells, it is not recommended to regress out this information.

Custering

Different clustering resolutions are explored from lower (less specific cell type) to higher granularity (more specific cell type).

To view how clusters sub-divide as resolution increases, typically a clustering tree is visualized. Following figure represents, at each row, one specific clustering resolution (from 0.1 to 0.8 included). As it is interpreted, first level of clustering (resolution 0.1) seems to be too general while above 0.4 resolution clusters are quite stable.

Clusters can be represented over UMAP, following graphs show this layout for clustering resolutions 0.3 and 0.4 (which seem to show certain stability)

For now, a clustering resolution of 0.4 is selected, the main difference with previous resolution is regarding clusters 4 and 6. The distribution of cells among clusters and between group is:

##     
##      g7_s1 g7_s2 g7_s3 WT_s1 WT_s2 WT_s3
##   0   1234  1140   919  1117  1115  1334
##   1    789   743   659  1031   880  1046
##   2    480   442   450   624   552   682
##   3    530   433   410   568   466   579
##   4    605   547   510   409   384   440
##   5    547   482   485   437   430   438
##   6    549   497   426   358   333   424
##   7    343   308   263   323   351   402
##   8    283   242   215   333   310   348
##   9    275   261   233   203   209   237
##   10   213   132   150   154   119   167
##   11   192   154   125   150   101   121
##   12   108   119    92   158   160   173
##   13   110    93   122    56    54    68
##   14    22    32    24    34    39    45

In relative terms and represented in a barplot:

Representative genes per cluster

In this section the marker genes per cluster are obtained. Conserved marker genes between WT and induced cells (g7) are identified. This information may be useful for helping in cell type annotation. Markers obtained per cluster are stored in an excel file.

Considering the top100 markers (sorted by their log2FC) per cluster, following matrix show the overlapping degreen among clusters. This is useful to assess if any cluster should be agregrated.

{width=“600px”,height=600} As indicated in previous matrix, some pairs of clusters share a notable number of marker genes:

  • clusters 11 and 14 share most of their top markers (80%), these clusters may be be aggregated. This is to be decided after cell type annotation assessment.
  • clusters 0 and 9 share almost have of their top markers (49%). Their aggregation also has to be evaluated after cell type annotation.
  • clusters 1 and 12 share 34% of their top markers. For now, these are to be kept separately.

Data imputation and smoothing with MAGIC

MAGIC (Markov affinity-based graph imputation of cells) is a tool that allows scRNAseq data imputation (MAGIC?). This tool is applied to SCTransform counts. Obtained matrix is further used for visualization of gene expression data (individual genes or signatures - through their mean). Tool allows the user to adjust the diffusion time t which controls the extents the filtering (biological signal vs noise). Authors state that values between 3 and 12 would be the ideal range for most of the datasets. In this case, the default value (t=3) is used which would be the minimum recommended (empyrically). As t increases, more diffusion is applied => we could be removing biological signal. So, this is a conservative decision which seems to be enough based on the smoothed expression values obtained.

Annex: Kdr+ cells

Cells positive for Kdr+ are selected for further pseudobulk analysis. Deeply looking into Kdr expression across clusters (considering imputed and smoothed expression values):

Cells are subset in those cases where they have > 1 raw counts in Kdr gene for further analysis. This subset includes 5954 cells which (scaled) expression is limited to (in terms of clusters):

The distribution of Kdr+ cells among samples is the following. This is useful to check that there is no bias among samples.

## 
## g7_s1 g7_s2 g7_s3 WT_s1 WT_s2 WT_s3 
##  1029   907   920  1102   928  1068

Session Information

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Madrid
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-3    openxlsx_4.2.5.2      ComplexHeatmap_2.16.0
##  [4] metap_1.10            multtest_2.56.0       Biobase_2.60.0       
##  [7] BiocGenerics_0.46.0   clustree_0.5.1        ggraph_2.1.0         
## [10] glmGamPoi_1.12.2      Seurat_5.0.0          SeuratObject_5.0.0   
## [13] sp_2.1-1              ggpubr_0.6.0          stringr_1.5.0        
## [16] dplyr_1.1.4           data.table_1.14.8     ggplot2_3.4.4        
## [19] knitr_1.44           
## 
## loaded via a namespace (and not attached):
##   [1] RcppAnnoy_0.0.21            splines_4.3.1              
##   [3] later_1.3.1                 bitops_1.0-7               
##   [5] tibble_3.2.1                polyclip_1.10-4            
##   [7] fastDummies_1.7.3           lifecycle_1.0.4            
##   [9] Rdpack_2.6                  rstatix_0.7.2              
##  [11] doParallel_1.0.17           globals_0.16.2             
##  [13] lattice_0.21-8              MASS_7.3-60                
##  [15] backports_1.4.1             magrittr_2.0.3             
##  [17] plotly_4.10.2               sass_0.4.7                 
##  [19] rmarkdown_2.24              plotrix_3.8-2              
##  [21] jquerylib_0.1.4             yaml_2.3.7                 
##  [23] qqconf_1.3.2                httpuv_1.6.11              
##  [25] sn_2.1.1                    sctransform_0.4.1          
##  [27] zip_2.3.0                   spam_2.10-0                
##  [29] spatstat.sparse_3.0-3       reticulate_1.34.0          
##  [31] cowplot_1.1.1               pbapply_1.7-2              
##  [33] multcomp_1.4-25             abind_1.4-5                
##  [35] zlibbioc_1.46.0             Rtsne_0.16                 
##  [37] GenomicRanges_1.52.0        purrr_1.0.2                
##  [39] RCurl_1.98-1.12             TH.data_1.1-2              
##  [41] sandwich_3.1-0              tweenr_2.0.2               
##  [43] circlize_0.4.15             GenomeInfoDbData_1.2.10    
##  [45] IRanges_2.34.1              S4Vectors_0.38.1           
##  [47] ggrepel_0.9.5               irlba_2.3.5.1              
##  [49] listenv_0.9.0               spatstat.utils_3.0-5       
##  [51] TFisher_0.2.0               goftest_1.2-3              
##  [53] RSpectra_0.16-1             spatstat.random_3.2-1      
##  [55] fitdistrplus_1.1-11         parallelly_1.36.0          
##  [57] leiden_0.4.3                codetools_0.2-19           
##  [59] DelayedArray_0.26.7         ggforce_0.4.1              
##  [61] shape_1.4.6                 tidyselect_1.2.0           
##  [63] farver_2.1.1                viridis_0.6.4              
##  [65] matrixStats_1.0.0           stats4_4.3.1               
##  [67] spatstat.explore_3.2-5      mathjaxr_1.6-0             
##  [69] jsonlite_1.8.8              GetoptLong_1.0.5           
##  [71] ellipsis_0.3.2              tidygraph_1.2.3            
##  [73] progressr_0.14.0            iterators_1.0.14           
##  [75] ggridges_0.5.4              survival_3.5-5             
##  [77] foreach_1.5.2               tools_4.3.1                
##  [79] ica_1.0-3                   Rcpp_1.0.11                
##  [81] glue_1.6.2                  mnormt_2.1.1               
##  [83] gridExtra_2.3               xfun_0.40                  
##  [85] MatrixGenerics_1.12.3       GenomeInfoDb_1.36.3        
##  [87] numDeriv_2016.8-1.1         withr_2.5.2                
##  [89] fastmap_1.1.1               fansi_1.0.4                
##  [91] digest_0.6.33               R6_2.5.1                   
##  [93] mime_0.12                   colorspace_2.1-0           
##  [95] scattermore_1.2             tensor_1.5                 
##  [97] spatstat.data_3.0-3         utf8_1.2.3                 
##  [99] tidyr_1.3.0                 generics_0.1.3             
## [101] graphlayouts_1.0.0          httr_1.4.7                 
## [103] htmlwidgets_1.6.2           S4Arrays_1.2.0             
## [105] uwot_0.1.16                 pkgconfig_2.0.3            
## [107] gtable_0.3.4                lmtest_0.9-40              
## [109] XVector_0.40.0              htmltools_0.5.6            
## [111] carData_3.0-5               dotCall64_1.1-0            
## [113] clue_0.3-64                 scales_1.3.0               
## [115] png_0.1-8                   rstudioapi_0.15.0          
## [117] rjson_0.2.21                reshape2_1.4.4             
## [119] checkmate_2.2.0             nlme_3.1-162               
## [121] GlobalOptions_0.1.2         cachem_1.0.8               
## [123] zoo_1.8-12                  KernSmooth_2.23-21         
## [125] vipor_0.4.7                 parallel_4.3.1             
## [127] miniUI_0.1.1.1              ggrastr_1.0.2              
## [129] pillar_1.9.0                vctrs_0.6.5                
## [131] RANN_2.6.1                  promises_1.2.1             
## [133] car_3.1-2                   xtable_1.8-4               
## [135] cluster_2.1.4               beeswarm_0.4.0             
## [137] evaluate_0.21               mvtnorm_1.2-3              
## [139] cli_3.6.2                   compiler_4.3.1             
## [141] rlang_1.1.2                 crayon_1.5.2               
## [143] mutoss_0.1-13               future.apply_1.11.0        
## [145] ggsignif_0.6.4              labeling_0.4.3             
## [147] ggbeeswarm_0.7.2            plyr_1.8.8                 
## [149] stringi_1.7.12              viridisLite_0.4.2          
## [151] deldir_1.0-9                munsell_0.5.0              
## [153] lazyeval_0.2.2              spatstat.geom_3.2-7        
## [155] Matrix_1.6-1                RcppHNSW_0.5.0             
## [157] patchwork_1.2.0             future_1.33.0              
## [159] shiny_1.7.5                 SummarizedExperiment_1.30.2
## [161] rbibutils_2.2.16            ROCR_1.0-11                
## [163] igraph_2.0.3                broom_1.0.5                
## [165] bslib_0.5.1

References

Ahlmann-Eltze, Constantin, and Wolfgang Huber. 2021. glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data.” Bioinformatics 36 (24): 5701–2. https://doi.org/10.1093/bioinformatics/btaa1009.
Hao, Yuhan, Tim Stuart, Madeline H Kowalski, Saket Choudhary, Paul Hoffman, Austin Hartman, Avi Srivastava, et al. 2023. “Dictionary Learning for Integrative, Multimodal and Scalable Single-Cell Analysis.” Nature Biotechnology. https://doi.org/10.1038/s41587-023-01767-y.
Tirosh, Itay, Benjamin Izar, Sanjay M Prakadan, Marc H 2nd Wadsworth, Daniel Treacy, John J Trombetta, Asaf Rotem, et al. 2016. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science (New York, N.Y.) 352 (6282): 189–96. https://doi.org/10.1126/science.aad0501.